Founded in 2000 by a high school teacher in the Bronx, DonorsChoose.org empowers public school teachers from across the country to request much-needed materials and experiences for their students. At any given time, there are thousands of classroom requests that can be brought to life with a gift of any amount.
DonorsChoose.org receives hundreds of thousands of project proposals each year for classroom projects in need of funding. Right now, a large number of volunteers is needed to manually screen each submission before it's approved to be posted on the DonorsChoose.org website.
Next year, DonorsChoose.org expects to receive close to 500,000 project proposals. As a result, there are three main problems they need to solve:
The goal of the competition is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school. DonorsChoose.org can then use this information to identify projects most likely to need further review before approval.
The train.csv data set provided by DonorsChoose contains the following features:
| Feature | Description |
|---|---|
project_id |
A unique identifier for the proposed project. Example: p036502 |
project_title |
Title of the project. Examples:
|
project_grade_category |
Grade level of students for which the project is targeted. One of the following enumerated values:
|
project_subject_categories |
One or more (comma-separated) subject categories for the project from the following enumerated list of values:
Examples:
|
school_state |
State where school is located (Two-letter U.S. postal code). Example: WY |
project_subject_subcategories |
One or more (comma-separated) subject subcategories for the project. Examples:
|
project_resource_summary |
An explanation of the resources needed for the project. Example:
|
project_essay_1 |
First application essay* |
project_essay_2 |
Second application essay* |
project_essay_3 |
Third application essay* |
project_essay_4 |
Fourth application essay* |
project_submitted_datetime |
Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245 |
teacher_id |
A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56 |
teacher_prefix |
Teacher's title. One of the following enumerated values:
|
teacher_number_of_previously_posted_projects |
Number of project applications previously submitted by the same teacher. Example: 2 |
* See the section Notes on the Essay Data for more details about these features.
Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:
| Feature | Description |
|---|---|
id |
A project_id value from the train.csv file. Example: p036502 |
description |
Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25 |
quantity |
Quantity of the resource required. Example: 3 |
price |
Price of the resource required. Example: 9.95 |
Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:
The data set contains the following label (the value you will attempt to predict):
| Label | Description |
|---|---|
project_is_approved |
A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved. |
# numpy for easy numerical computations
import numpy as np
# pandas for dataframes and filterings
import pandas as pd
# sqlite3 library for performing operations on sqlite file
import sqlite3
# matplotlib for plotting graphs
import matplotlib.pyplot as plt
# seaborn library for easy plotting
import seaborn as sbrn
# warnings library for specific settings
import warnings
# regularlanguage for regex operations
import re
# For loading precomputed models
import pickle
# For loading natural language processing tool-kit
import nltk
# For calculating mathematical terms
import math
# For plotting 3-D Plot
import plotly.offline as offline
import plotly.graph_objs as go
offline.init_notebook_mode()
# For loading files from google drive
from google.colab import drive
# For working with files in google drive
drive.mount('/content/drive')
# tqdm for tracking progress of loops
from tqdm import tqdm_notebook as tqdm
# For creating dictionary of words
from collections import Counter
# For creating BagOfWords Model
from sklearn.feature_extraction.text import CountVectorizer
# For creating TfidfModel
from sklearn.feature_extraction.text import TfidfVectorizer
# For standardizing values
from sklearn.preprocessing import StandardScaler
# For merging sparse matrices along row direction
from scipy.sparse import hstack
# For merging sparse matrices along column direction
from scipy.sparse import vstack
# For converting dataframes into sparse matrix
from scipy.sparse import csr_matrix
# For calculating TSNE values
from sklearn.manifold import TSNE
# For calculating the accuracy score on cross validate data
from sklearn.metrics import accuracy_score
# For performing the k-fold cross validation
from sklearn.model_selection import cross_val_score
# For splitting the data set into test and train data
from sklearn import model_selection
# For using decison tree classifier
from sklearn import tree
# For generating word cloud
from wordcloud import WordCloud
# For using random forest classifier
from sklearn.ensemble import RandomForestClassifier
# For using gradient boosting classifier
import xgboost as xgb
# For plotting decision tree
import graphviz
# For using svm classifer - hinge loss function of sgd
from sklearn import linear_model
# For creating samples for making dataset balanced
from sklearn.utils import resample
# For shuffling the dataframes
from sklearn.utils import shuffle
# For calculating roc_curve parameters
from sklearn.metrics import roc_curve
# For calculating auc value
from sklearn.metrics import auc
# For displaying results in table format
from prettytable import PrettyTable
# For generating confusion matrix
from sklearn.metrics import confusion_matrix
# For using gridsearch cv to find best parameter
from sklearn.model_selection import GridSearchCV
# For using randomized search cross validation
from sklearn.model_selection import RandomizedSearchCV
# For performing min-max standardization to features
from sklearn.preprocessing import MinMaxScaler
# For calculating sentiment score of the text
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
warnings.filterwarnings('ignore')
projectsData = pd.read_csv('drive/My Drive/train_data.csv');
resourcesData = pd.read_csv('drive/My Drive/resources.csv');
projectsData.head(3)
projectsData.tail(3)
resourcesData.head(3)
resourcesData.tail(3)
def equalsBorder(numberOfEqualSigns):
"""
This function prints passed number of equal signs
"""
print("="* numberOfEqualSigns);
# Citation link: https://stackoverflow.com/questions/8924173/how-do-i-print-bold-text-in-python
class color:
PURPLE = '\033[95m'
CYAN = '\033[96m'
DARKCYAN = '\033[36m'
BLUE = '\033[94m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
END = '\033[0m'
def printStyle(text, style):
"This function prints text with the style passed to it"
print(style + text + color.END);
printStyle("Number of data points in projects data: {}".format(projectsData.shape[0]), color.BOLD);
printStyle("Number of attributes in projects data:{}".format(projectsData.shape[1]), color.BOLD);
equalsBorder(60);
printStyle("Number of data points in resources data: {}".format(resourcesData.shape[0]), color.BOLD);
printStyle("Number of attributes in resources data: {}".format(resourcesData.shape[1]), color.BOLD);
# remove special characters from list of strings python: https://stackoverflow.com/a/47301924/4084039
# https://www.geeksforgeeks.org/removing-stop-words-nltk-python/
# https://stackoverflow.com/questions/23669024/how-to-strip-a-specific-word-from-a-string
# https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string-in-python
def cleanCategories(subjectCategories):
cleanedCategories = []
for subjectCategory in tqdm(subjectCategories):
tempCategory = ""
for category in subjectCategory.split(","):
if 'The' in category.split(): # this will split each of the catogory based on space "Math & Science"=> "Math","&", "Science"
category = category.replace('The','') # if we have the words "The" we are going to replace it with ''(i.e removing 'The')
category = category.replace(' ','') # we are placeing all the ' '(space) with ''(empty) ex:"Math & Science"=>"Math&Science"
tempCategory += category.strip()+" "#" abc ".strip() will return "abc", remove the trailing spaces
tempCategory = tempCategory.replace('&','_')
cleanedCategories.append(tempCategory)
return cleanedCategories
# projectDataWithCleanedCategories = pd.DataFrame(projectsData);
subjectCategories = list(projectsData.project_subject_categories);
cleanedCategories = cleanCategories(subjectCategories);
printStyle("Sample categories: ", color.BOLD);
equalsBorder(60);
print(subjectCategories[0:5]);
equalsBorder(60);
printStyle("Sample cleaned categories: ", color.BOLD);
equalsBorder(60);
print(cleanedCategories[0:5]);
projectsData['cleaned_categories'] = cleanedCategories;
projectsData.head(5)
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
categoriesCounter = Counter()
for subjectCategory in projectsData.cleaned_categories.values:
categoriesCounter.update(subjectCategory.split());
categoriesCounter
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
categoriesDictionary = dict(categoriesCounter);
sortedCategoriesDictionary = dict(sorted(categoriesDictionary.items(), key = lambda keyValue: keyValue[1]));
sortedCategoriesData = pd.DataFrame.from_dict(sortedCategoriesDictionary, orient='index');
sortedCategoriesData.columns = ['subject_categories'];
printStyle("Number of projects by Subject Categories: ", color.BOLD);
equalsBorder(60);
sortedCategoriesData
subjectSubCategories = projectsData.project_subject_subcategories;
cleanedSubCategories = cleanCategories(subjectSubCategories);
printStyle("Sample subject sub categories: ", color.BOLD);
equalsBorder(70);
print(subjectSubCategories[0:5]);
equalsBorder(70);
printStyle("Sample cleaned subject sub categories: ", color.BOLD);
equalsBorder(70);
print(cleanedSubCategories[0:5]);
projectsData['cleaned_sub_categories'] = cleanedSubCategories;
# count of all the words in corpus python: https://stackoverflow.com/a/22898595/4084039
subjectsSubCategoriesCounter = Counter();
for subCategory in projectsData.cleaned_sub_categories:
subjectsSubCategoriesCounter.update(subCategory.split());
subjectsSubCategoriesCounter
# dict sort by value python: https://stackoverflow.com/a/613218/4084039
dictionarySubCategories = dict(subjectsSubCategoriesCounter);
sortedDictionarySubCategories = dict(sorted(dictionarySubCategories.items(), key = lambda keyValue: keyValue[1]));
sortedSubCategoriesData = pd.DataFrame.from_dict(sortedDictionarySubCategories, orient = 'index');
sortedSubCategoriesData.columns = ['subject_sub_categories']
printStyle("Number of projects sorted by subject sub categories: ", color.BOLD);
equalsBorder(70);
sortedSubCategoriesData
projectsData['project_essay'] = projectsData['project_essay_1'].map(str) + projectsData['project_essay_2'].map(str) + \
projectsData['project_essay_3'].map(str) + projectsData['project_essay_4'].map(str);
projectsData.head(5)
priceAndQuantityData = resourcesData.groupby('id').agg({'price': 'sum', 'quantity': 'sum'}).reset_index();
priceAndQuantityData.head(5)
projectsData.shape
projectsData = pd.merge(projectsData, priceAndQuantityData, on = 'id', how = 'left');
print(projectsData.shape);
projectsData.head(3)
projectsData[projectsData['id'] == 'p253737']
priceAndQuantityData[priceAndQuantityData['id'] == 'p253737']
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# All stopwords that are needed to be removed in the text
stopWords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"]);
def preProcessingWithAndWithoutStopWords(texts):
"""
This function takes list of texts and returns preprocessed list of texts one with
stop words and one without stopwords.
"""
# Variable for storing preprocessed text with stop words
preProcessedTextsWithStopWords = [];
# Variable for storing preprocessed text without stop words
preProcessedTextsWithoutStopWords = [];
# Looping over list of texts for performing pre processing
for text in tqdm(texts, total = len(texts)):
# Removing all links in the text
text = re.sub(r"http\S+", "", text);
# Removing all html tags in the text
text = re.sub(r"<\w+/>", "", text);
text = re.sub(r"<\w+>", "", text);
# https://stackoverflow.com/a/47091490/4084039
# Replacing all below words with adverbs
text = re.sub(r"won't", "will not", text)
text = re.sub(r"can\'t", "can not", text)
text = re.sub(r"n\'t", " not", text)
text = re.sub(r"\'re", " are", text)
text = re.sub(r"\'s", " is", text)
text = re.sub(r"\'d", " would", text)
text = re.sub(r"\'ll", " will", text)
text = re.sub(r"\'t", " not", text)
text = re.sub(r"\'ve", " have", text)
text = re.sub(r"\'m", " am", text)
# Removing backslash symbols in text
text = text.replace('\\r', ' ');
text = text.replace('\\n', ' ');
text = text.replace('\\"', ' ');
# Removing all special characters of text
text = re.sub(r"[^a-zA-Z0-9]+", " ", text);
# Converting whole review text into lower case
text = text.lower();
# adding this preprocessed text without stopwords to list
preProcessedTextsWithStopWords.append(text);
# removing stop words from text
textWithoutStopWords = ' '.join([word for word in text.split() if word not in stopWords]);
# adding this preprocessed text without stopwords to list
preProcessedTextsWithoutStopWords.append(textWithoutStopWords);
return [preProcessedTextsWithStopWords, preProcessedTextsWithoutStopWords];
texts = [projectsData['project_essay'].values[0]]
preProcessedTextsWithStopWords, preProcessedTextsWithoutStopWords = preProcessingWithAndWithoutStopWords(texts);
print("Example project essay without pre-processing: ");
equalsBorder(70);
print(texts);
equalsBorder(70);
print("Example project essay with stop words and pre-processing: ");
equalsBorder(70);
print(preProcessedTextsWithStopWords);
equalsBorder(70);
print("Example project essay without stop words and pre-processing: ");
equalsBorder(70);
print(preProcessedTextsWithoutStopWords);
projectEssays = projectsData['project_essay'];
preProcessedEssaysWithStopWords, preProcessedEssaysWithoutStopWords = preProcessingWithAndWithoutStopWords(projectEssays);
preProcessedEssaysWithoutStopWords[0:3]
projectTitles = projectsData['project_title'];
preProcessedProjectTitlesWithStopWords, preProcessedProjectTitlesWithoutStopWords = preProcessingWithAndWithoutStopWords(projectTitles);
preProcessedProjectTitlesWithoutStopWords[0:5]
projectsData['preprocessed_titles'] = preProcessedProjectTitlesWithoutStopWords;
projectsData['preprocessed_essays'] = preProcessedEssaysWithoutStopWords;
projectsData.shape
pd.DataFrame(projectsData.columns, columns = ['All features in projects data'])
projectsData = projectsData.dropna(subset = ['teacher_prefix']);
projectsData.shape
classesData = projectsData['project_is_approved']
print(classesData.shape)
trainingData, testData, classesTraining, classesTest = model_selection.train_test_split(projectsData, classesData, test_size = 0.3, random_state = 44, stratify = classesData);
trainingData, crossValidateData, classesTraining, classesCrossValidate = model_selection.train_test_split(trainingData, classesTraining, test_size = 0.3, random_state = 0, stratify = classesTraining);
print("Shapes of splitted data: ");
equalsBorder(70);
print("testData shape: ", testData.shape);
print("classesTest: ", classesTest.shape);
print("trainingData shape: ", trainingData.shape);
print("classesTraining shape: ", classesTraining.shape);
print("Number of negative points: ", trainingData[trainingData['project_is_approved'] == 0].shape);
print("Number of positive points: ", trainingData[trainingData['project_is_approved'] == 1].shape);
vectorizedFeatureNames = [];
# Categorizing subjects categories feature using response encoding
subjectsCategoriesResponseData = [dict(), dict()];
for index, dataPoint in trainingData.iterrows():
subjectCategory = dataPoint['cleaned_categories'];
classValue = dataPoint['project_is_approved'];
if(subjectCategory in subjectsCategoriesResponseData[classValue]):
subjectsCategoriesResponseData[classValue][subjectCategory] += 1;
else:
subjectsCategoriesResponseData[classValue][subjectCategory] = 1;
allSubjectCategories = set(list(subjectsCategoriesResponseData[0].keys()) + list(subjectsCategoriesResponseData[1].keys()));
for subjectCategory in allSubjectCategories:
if(subjectCategory not in subjectsCategoriesResponseData[0]):
subjectsCategoriesResponseData[0][subjectCategory] = 0;
if(subjectCategory not in subjectsCategoriesResponseData[1]):
subjectsCategoriesResponseData[1][subjectCategory] = 0;
def subjectsCategoriesTransform(subjectCategories):
transformedData = pd.DataFrame(columns = ['SubjectsCategories0', 'SubjectsCategories1']);
numRows = len(subjectCategories);
for index, subjectCategory in enumerate(tqdm(subjectCategories)):
if subjectCategory in allSubjectCategories:
class0Value = subjectsCategoriesResponseData[0][subjectCategory];
class1Value = subjectsCategoriesResponseData[1][subjectCategory];
totalValue = class0Value + class1Value;
class0Value = class0Value / totalValue;
class1Value = class1Value / totalValue;
transformedData.loc[index] = [class0Value, class1Value];
else:
transformedData.loc[index] = [0.5, 0.5];
return csr_matrix(transformedData);
categoriesVector = subjectsCategoriesTransform(trainingData['cleaned_categories'].values);
print("Features used in vectorizing subject categories: ");
equalsBorder(70);
print(list(allSubjectCategories));
equalsBorder(70);
print("Shape of cleaned_categories matrix after vectorization(response encoding): ", categoriesVector.shape);
equalsBorder(70);
print("Sample vectors of categories: ");
equalsBorder(70);
print(categoriesVector[0:4])
# Categorizing subjects sub categories feature using response encoding
subjectsSubCategoriesResponseData = [dict(), dict()];
for index, dataPoint in trainingData.iterrows():
subjectSubCategory = dataPoint['cleaned_sub_categories'];
classValue = dataPoint['project_is_approved'];
if(subjectSubCategory in subjectsSubCategoriesResponseData[classValue]):
subjectsSubCategoriesResponseData[classValue][subjectSubCategory] += 1;
else:
subjectsSubCategoriesResponseData[classValue][subjectSubCategory] = 1;
allSubjectSubCategories = set(list(subjectsSubCategoriesResponseData[0].keys()) + list(subjectsSubCategoriesResponseData[1].keys()));
for subjectSubCategory in allSubjectSubCategories:
if(subjectSubCategory not in subjectsSubCategoriesResponseData[0]):
subjectsSubCategoriesResponseData[0][subjectSubCategory] = 0;
if(subjectSubCategory not in subjectsSubCategoriesResponseData[1]):
subjectsSubCategoriesResponseData[1][subjectSubCategory] = 0;
def subjectsSubCategoriesTransform(subjectSubCategories):
transformedData = pd.DataFrame(columns = ['SubjectsSubCategories0', 'SubjectsSubCategories1']);
numRows = len(subjectSubCategories);
for index, subjectSubCategory in enumerate(tqdm(subjectSubCategories)):
if subjectSubCategory in allSubjectSubCategories:
class0Value = subjectsSubCategoriesResponseData[0][subjectSubCategory];
class1Value = subjectsSubCategoriesResponseData[1][subjectSubCategory];
totalValue = class0Value + class1Value;
class0Value = class0Value / totalValue;
class1Value = class1Value / totalValue;
transformedData.loc[index] = [class0Value, class1Value];
else:
transformedData.loc[index] = [0.5, 0.5];
return csr_matrix(transformedData);
subCategoriesVectors = subjectsSubCategoriesTransform(trainingData['cleaned_sub_categories'].values);
print("Features used in vectorizing subject sub categories: ");
equalsBorder(70);
print(list(allSubjectSubCategories));
equalsBorder(70);
print("Shape of cleaned_sub_categories matrix after vectorization(response encoding): ", subCategoriesVectors.shape);
equalsBorder(70);
print("Sample vectors of sub categories: ");
equalsBorder(70);
print(subCategoriesVectors[0:4])
def giveCounter(data):
counter = Counter();
for dataValue in data:
counter.update(str(dataValue).split());
return counter
giveCounter(trainingData['teacher_prefix'].values)
# Categorizing teacher prefixes feature using response encoding
teacherPrefixResponseData = [dict(), dict()];
for index, dataPoint in trainingData.iterrows():
teacherPrefix = dataPoint['teacher_prefix'];
classValue = dataPoint['project_is_approved'];
if(teacherPrefix in teacherPrefixResponseData[classValue]):
teacherPrefixResponseData[classValue][teacherPrefix] += 1;
else:
teacherPrefixResponseData[classValue][teacherPrefix] = 1;
allTeacherPrefixes = set(list(teacherPrefixResponseData[0].keys()) + list(teacherPrefixResponseData[1].keys()));
for teacherPrefix in allTeacherPrefixes:
if(teacherPrefix not in teacherPrefixResponseData[0]):
teacherPrefixResponseData[0][teacherPrefix] = 0;
if(teacherPrefix not in teacherPrefixResponseData[1]):
teacherPrefixResponseData[1][teacherPrefix] = 0;
def teacherPrefixTransform(teacherPrefixes):
transformedData = pd.DataFrame(columns = ['teacherPrefixes0', 'teacherPrefixes1']);
numRows = len(teacherPrefixes);
for index, teacherPrefix in enumerate(tqdm(teacherPrefixes)):
if teacherPrefix in allTeacherPrefixes:
class0Value = teacherPrefixResponseData[0][teacherPrefix];
class1Value = teacherPrefixResponseData[1][teacherPrefix];
totalValue = class0Value + class1Value;
class0Value = class0Value / totalValue;
class1Value = class1Value / totalValue;
transformedData.loc[index] = [class0Value, class1Value];
else:
transformedData.loc[index] = [0.5, 0.5];
return csr_matrix(transformedData);
teacherPrefixVectors = teacherPrefixTransform(trainingData['teacher_prefix'].values);
print("Features used in vectorizing teacher prefixes: ");
equalsBorder(70);
print(list(allTeacherPrefixes));
equalsBorder(70);
print("Shape of teacher prefixes matrix after vectorization(response encoding): ", teacherPrefixVectors.shape);
equalsBorder(70);
print("Sample vectors of teacher prefixes: ");
equalsBorder(70);
print(teacherPrefixVectors[0:4]);
# Categorizing school state feature using response encoding
schoolStateResponseData = [dict(), dict()];
for index, dataPoint in trainingData.iterrows():
schoolState = dataPoint['school_state'];
classValue = dataPoint['project_is_approved'];
if(schoolState in schoolStateResponseData[classValue]):
schoolStateResponseData[classValue][schoolState] += 1;
else:
schoolStateResponseData[classValue][schoolState] = 1;
allSchoolStates = set(list(schoolStateResponseData[0].keys()) + list(schoolStateResponseData[1].keys()));
for schoolState in allSchoolStates:
if(schoolState not in schoolStateResponseData[0]):
schoolStateResponseData[0][schoolState] = 0;
if(schoolState not in schoolStateResponseData[1]):
schoolStateResponseData[1][schoolState] = 0;
def schoolStateTransform(schoolStates):
transformedData = pd.DataFrame(columns = ['SchoolStates0', 'SchoolStates1']);
numRows = len(schoolStates);
for index, schoolState in enumerate(tqdm(schoolStates)):
if schoolState in allSchoolStates:
class0Value = schoolStateResponseData[0][schoolState];
class1Value = schoolStateResponseData[1][schoolState];
totalValue = class0Value + class1Value;
class0Value = class0Value / totalValue;
class1Value = class1Value / totalValue;
transformedData.loc[index] = [class0Value, class1Value];
else:
transformedData.loc[index] = [0.5, 0.5];
return csr_matrix(transformedData);
schoolStateVectors = schoolStateTransform(trainingData['school_state'].values);
print("Features used in vectorizing school states: ");
equalsBorder(70);
print(list(allSchoolStates));
equalsBorder(70);
print("Shape of school states matrix after vectorization(response encoding): ", schoolStateVectors.shape);
equalsBorder(70);
print("Sample vectors of school states: ");
equalsBorder(70);
print(schoolStateVectors[0:4]);
giveCounter(trainingData['project_grade_category'])
cleanedGrades = []
for grade in trainingData['project_grade_category'].values:
grade = grade.replace(' ', '');
grade = grade.replace('-', 'to');
cleanedGrades.append(grade);
cleanedGrades[0:4]
trainingData['project_grade_category'] = cleanedGrades
trainingData.head(4)
# Categorizing project grade feature using response encoding
projectGradeResponseData = [dict(), dict()];
for index, dataPoint in trainingData.iterrows():
projectGrade = dataPoint['project_grade_category'];
classValue = dataPoint['project_is_approved'];
if(schoolState in projectGradeResponseData[classValue]):
projectGradeResponseData[classValue][projectGrade] += 1;
else:
projectGradeResponseData[classValue][projectGrade] = 1;
allProjectGrades = set(list(projectGradeResponseData[0].keys()) + list(projectGradeResponseData[1].keys()));
for projectGrade in allProjectGrades:
if(projectGrade not in projectGradeResponseData[0]):
projectGradeResponseData[0][projectGrade] = 0;
if(projectGrade not in projectGradeResponseData[1]):
projectGradeResponseData[1][projectGrade] = 0;
def projectGradeTransform(projectGrades):
transformedData = pd.DataFrame(columns = ['ProjectGrades0', 'ProjectGrades1']);
numRows = len(projectGrades);
for index, projectGrade in enumerate(tqdm(projectGrades)):
if projectGrade in allProjectGrades:
class0Value = projectGradeResponseData[0][projectGrade];
class1Value = projectGradeResponseData[1][projectGrade];
totalValue = class0Value + class1Value;
class0Value = class0Value / totalValue;
class1Value = class1Value / totalValue;
transformedData.loc[index] = [class0Value, class1Value];
else:
transformedData.loc[index] = [0.5, 0.5];
return csr_matrix(transformedData);
projectGradeVectors = projectGradeTransform(trainingData['project_grade_category'].values);
print("Features used in vectorizing project grades: ");
equalsBorder(70);
print(list(allProjectGrades));
equalsBorder(70);
print("Shape of project grades matrix after vectorization(response encoding): ", projectGradeVectors.shape);
equalsBorder(70);
print("Sample vectors of project grades: ");
equalsBorder(70);
print(projectGradeVectors[0:4]);
preProcessedEssaysWithStopWords, preProcessedEssaysWithoutStopWords = preProcessingWithAndWithoutStopWords(trainingData['project_essay']);
preProcessedProjectTitlesWithStopWords, preProcessedProjectTitlesWithoutStopWords = preProcessingWithAndWithoutStopWords(trainingData['project_title']);
bagOfWordsVectorizedFeatures = [];
# Initializing countvectorizer for bag of words vectorization of preprocessed project essays
bowEssayVectorizer = CountVectorizer(min_df = 10, max_features = 5000);
# Transforming the preprocessed essays to bag of words vectors
bowEssayModel = bowEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
print("Some of the Features used in vectorizing preprocessed essays: ");
equalsBorder(70);
print(bowEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed essay matrix after vectorization: ", bowEssayModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed essay: ");
equalsBorder(70);
print(bowEssayModel[0])
# Initializing countvectorizer for bag of words vectorization of preprocessed project titles
bowTitleVectorizer = CountVectorizer(min_df = 10);
# Transforming the preprocessed project titles to bag of words vectors
bowTitleModel = bowTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWords);
print("Some of the Features used in vectorizing preprocessed titles: ");
equalsBorder(70);
print(bowTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after vectorization: ", bowTitleModel.shape);
equalsBorder(70);
print("Sample bag-of-words vector of preprocessed title: ");
equalsBorder(70);
print(bowTitleModel[0])
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project essays
tfIdfEssayVectorizer = TfidfVectorizer(min_df = 10, max_features = 5000);
# Transforming the preprocessed project essays to tf-idf vectors
tfIdfEssayModel = tfIdfEssayVectorizer.fit_transform(preProcessedEssaysWithoutStopWords);
print("Some of the Features used in tf-idf vectorizing preprocessed essays: ");
equalsBorder(70);
print(tfIdfEssayVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfEssayModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed essay: ");
equalsBorder(70);
print(tfIdfEssayModel[0])
# Intializing tfidf vectorizer for tf-idf vectorization of preprocessed project titles
tfIdfTitleVectorizer = TfidfVectorizer(min_df = 10);
# Transforming the preprocessed project titles to tf-idf vectors
tfIdfTitleModel = tfIdfTitleVectorizer.fit_transform(preProcessedProjectTitlesWithoutStopWords);
print("Some of the Features used in tf-idf vectorizing preprocessed titles: ");
equalsBorder(70);
print(tfIdfTitleVectorizer.get_feature_names()[-40:]);
equalsBorder(70);
print("Shape of preprocessed title matrix after tf-idf vectorization: ", tfIdfTitleModel.shape);
equalsBorder(70);
print("Sample Tf-Idf vector of preprocessed title: ");
equalsBorder(70);
print(tfIdfTitleModel[0])
# stronging variables into pickle files python: http://www.jessicayung.com/how-to-use-pickle-to-save-and-load-variables-in-python/
# We should have glove_vectors file for creating below model
with open('drive/My Drive/glove_vectors', 'rb') as f:
gloveModel = pickle.load(f)
gloveWords = set(gloveModel.keys())
print("Glove vector of sample word: ");
equalsBorder(70);
print(gloveModel['technology']);
equalsBorder(70);
print("Shape of glove vector: ", gloveModel['technology'].shape);
def getWord2VecVectors(texts):
word2VecTextsVectors = [];
for preProcessedText in tqdm(texts):
word2VecTextVector = np.zeros(300);
numberOfWordsInText = 0;
for word in preProcessedText.split():
if word in gloveWords:
word2VecTextVector += gloveModel[word];
numberOfWordsInText += 1;
if numberOfWordsInText != 0:
word2VecTextVector = word2VecTextVector / numberOfWordsInText;
word2VecTextsVectors.append(word2VecTextVector);
return word2VecTextsVectors;
word2VecEssaysVectors = getWord2VecVectors(preProcessedEssaysWithoutStopWords);
print("Shape of Word2Vec vectorization matrix of essays: {},{}".format(len(word2VecEssaysVectors), len(word2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample essay: ");
equalsBorder(70);
print(word2VecEssaysVectors[0]);
word2VecTitlesVectors = getWord2VecVectors(preProcessedProjectTitlesWithoutStopWords);
print("Shape of Word2Vec vectorization matrix of project titles: {}, {}".format(len(word2VecTitlesVectors), len(word2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Word2Vec vector of sample title: ");
equalsBorder(70);
print(word2VecTitlesVectors[0]);
# Initializing tfidf vectorizer
tfIdfEssayTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed essays using tfidf vectorizer initialized above
tfIdfEssayTempVectorizer.fit(preProcessedEssaysWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfEssayDictionary = dict(zip(tfIdfEssayTempVectorizer.get_feature_names(), list(tfIdfEssayTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfEssayWords = set(tfIdfEssayTempVectorizer.get_feature_names());
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(preProcessedEssaysWithoutStopWords):
# Sum of tf-idf values of all words in a particular essay
cumulativeSumTfIdfWeightOfEssay = 0;
# Tf-Idf weighted word2vec vector of a particular essay
tfIdfWeightedWord2VecEssayVector = np.zeros(300);
# Splitting essay into list of words
splittedEssay = essay.split();
# Iterating over each word
for word in splittedEssay:
# Checking if word is in glove words and set of words used by tfIdf essay vectorizer
if (word in gloveWords) and (word in tfIdfEssayWords):
# Tf-Idf value of particular word in essay
tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfEssay != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
# Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project essays: {}, {}".format(len(tfIdfWeightedWord2VecEssaysVectors), len(tfIdfWeightedWord2VecEssaysVectors[0])));
equalsBorder(70);
print("Sample Essay: ");
equalsBorder(70);
print(preProcessedEssaysWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample essay: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecEssaysVectors[0]);
# Initializing tfidf vectorizer
tfIdfTitleTempVectorizer = TfidfVectorizer();
# Vectorizing preprocessed titles using tfidf vectorizer initialized above
tfIdfTitleTempVectorizer.fit(preProcessedProjectTitlesWithoutStopWords);
# Saving dictionary in which each word is key and it's idf is value
tfIdfTitleDictionary = dict(zip(tfIdfTitleTempVectorizer.get_feature_names(), list(tfIdfTitleTempVectorizer.idf_)));
# Creating set of all unique words used by tfidf vectorizer
tfIdfTitleWords = set(tfIdfTitleTempVectorizer.get_feature_names());
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(preProcessedProjectTitlesWithoutStopWords):
# Sum of tf-idf values of all words in a particular project title
cumulativeSumTfIdfWeightOfTitle = 0;
# Tf-Idf weighted word2vec vector of a particular project title
tfIdfWeightedWord2VecTitleVector = np.zeros(300);
# Splitting title into list of words
splittedTitle = title.split();
# Iterating over each word
for word in splittedTitle:
# Checking if word is in glove words and set of words used by tfIdf title vectorizer
if (word in gloveWords) and (word in tfIdfTitleWords):
# Tf-Idf value of particular word in title
tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfTitle != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
# Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
print("Shape of Tf-Idf weighted Word2Vec vectorization matrix of project titles: {}, {}".format(len(tfIdfWeightedWord2VecTitlesVectors), len(tfIdfWeightedWord2VecTitlesVectors[0])));
equalsBorder(70);
print("Sample Title: ");
equalsBorder(70);
print(preProcessedProjectTitlesWithoutStopWords[0]);
equalsBorder(70);
print("Tf-Idf Weighted Word2Vec vector of sample title: ");
equalsBorder(70);
print(tfIdfWeightedWord2VecTitlesVectors[0]);
def getAvgTfIdfEssayVectors(arrayOfTexts):
# Creating list to save tf-idf weighted vectors of essays
tfIdfWeightedWord2VecEssaysVectors = [];
# Iterating over each essay
for essay in tqdm(arrayOfTexts):
# Sum of tf-idf values of all words in a particular essay
cumulativeSumTfIdfWeightOfEssay = 0;
# Tf-Idf weighted word2vec vector of a particular essay
tfIdfWeightedWord2VecEssayVector = np.zeros(300);
# Splitting essay into list of words
splittedEssay = essay.split();
# Iterating over each word
for word in splittedEssay:
# Checking if word is in glove words and set of words used by tfIdf essay vectorizer
if (word in gloveWords) and (word in tfIdfEssayWords):
# Tf-Idf value of particular word in essay
tfIdfValueWord = tfIdfEssayDictionary[word] * (essay.count(word) / len(splittedEssay));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecEssayVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfEssay += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfEssay != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecEssayVector = tfIdfWeightedWord2VecEssayVector / cumulativeSumTfIdfWeightOfEssay;
# Appending the above calculated tf-idf weighted vector of particular essay to list of vectors of essays
tfIdfWeightedWord2VecEssaysVectors.append(tfIdfWeightedWord2VecEssayVector);
return tfIdfWeightedWord2VecEssaysVectors;
def getAvgTfIdfTitleVectors(arrayOfTexts):
# Creating list to save tf-idf weighted vectors of project titles
tfIdfWeightedWord2VecTitlesVectors = [];
# Iterating over each title
for title in tqdm(arrayOfTexts):
# Sum of tf-idf values of all words in a particular project title
cumulativeSumTfIdfWeightOfTitle = 0;
# Tf-Idf weighted word2vec vector of a particular project title
tfIdfWeightedWord2VecTitleVector = np.zeros(300);
# Splitting title into list of words
splittedTitle = title.split();
# Iterating over each word
for word in splittedTitle:
# Checking if word is in glove words and set of words used by tfIdf title vectorizer
if (word in gloveWords) and (word in tfIdfTitleWords):
# Tf-Idf value of particular word in title
tfIdfValueWord = tfIdfTitleDictionary[word] * (title.count(word) / len(splittedTitle));
# Making tf-idf weighted word2vec
tfIdfWeightedWord2VecTitleVector += tfIdfValueWord * gloveModel[word];
# Summing tf-idf weight of word to cumulative sum
cumulativeSumTfIdfWeightOfTitle += tfIdfValueWord;
if cumulativeSumTfIdfWeightOfTitle != 0:
# Taking average of sum of vectors with tf-idf cumulative sum
tfIdfWeightedWord2VecTitleVector = tfIdfWeightedWord2VecTitleVector / cumulativeSumTfIdfWeightOfTitle;
# Appending the above calculated tf-idf weighted vector of particular title to list of vectors of project titles
tfIdfWeightedWord2VecTitlesVectors.append(tfIdfWeightedWord2VecTitleVector);
return tfIdfWeightedWord2VecTitlesVectors;
# Standardizing the price data using StandardScaler(Uses mean and std for standardization)
priceScaler = MinMaxScaler();
priceScaler.fit(trainingData['price'].values.reshape(-1, 1));
priceStandardized = priceScaler.transform(trainingData['price'].values.reshape(-1, 1));
print("Shape of standardized matrix of prices: ", priceStandardized.shape);
equalsBorder(70);
print("Sample original prices: ");
equalsBorder(70);
print(trainingData['price'].values[0:5]);
print("Sample standardized prices: ");
equalsBorder(70);
print(priceStandardized[0:5]);
# Standardizing the quantity data using StandardScaler(Uses mean and std for standardization)
quantityScaler = MinMaxScaler();
quantityScaler.fit(trainingData['quantity'].values.reshape(-1, 1));
quantityStandardized = quantityScaler.transform(trainingData['quantity'].values.reshape(-1, 1));
print("Shape of standardized matrix of quantities: ", quantityStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['quantity'].values[0:5]);
print("Sample standardized quantities: ");
equalsBorder(70);
print(quantityStandardized[0:5]);
# Standardizing the teacher_number_of_previously_posted_projects data using StandardScaler(Uses mean and std for standardization)
previouslyPostedScaler = MinMaxScaler();
previouslyPostedScaler.fit(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
previouslyPostedStandardized = previouslyPostedScaler.transform(trainingData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
print("Shape of standardized matrix of teacher_number_of_previously_posted_projects: ", previouslyPostedStandardized.shape);
equalsBorder(70);
print("Sample original quantities: ");
equalsBorder(70);
print(trainingData['teacher_number_of_previously_posted_projects'].values[0:5]);
print("Sample standardized teacher_number_of_previously_posted_projects: ");
equalsBorder(70);
print(previouslyPostedStandardized[0:5]);
numberOfPoints = previouslyPostedStandardized.shape[0];
# Categorical data
categoriesVectorsSub = categoriesVector[0:numberOfPoints];
subCategoriesVectorsSub = subCategoriesVectors[0:numberOfPoints];
teacherPrefixVectorsSub = teacherPrefixVectors[0:numberOfPoints];
schoolStateVectorsSub = schoolStateVectors[0:numberOfPoints];
projectGradeVectorsSub = projectGradeVectors[0:numberOfPoints];
# Text data
bowEssayModelSub = bowEssayModel[0:numberOfPoints];
bowTitleModelSub = bowTitleModel[0:numberOfPoints];
tfIdfEssayModelSub = tfIdfEssayModel[0:numberOfPoints];
tfIdfTitleModelSub = tfIdfTitleModel[0:numberOfPoints];
# Numerical data
priceStandardizedSub = priceStandardized[0:numberOfPoints];
quantityStandardizedSub = quantityStandardized[0:numberOfPoints];
previouslyPostedStandardizedSub = previouslyPostedStandardized[0:numberOfPoints];
# Classes
classesTrainingSub = classesTraining;
randomForestsAndGbdtResultsDataFrame = pd.DataFrame(columns = ['Vectorizer', 'Model', 'Max Depth', 'N Estimators', 'AUC']);
randomForestsAndGbdtResultsDataFrame
# Test data categorical features transformation
categoriesTransformedCrossValidateData = subjectsCategoriesTransform(crossValidateData['cleaned_categories']);
subCategoriesTransformedCrossValidateData = subjectsSubCategoriesTransform(crossValidateData['cleaned_sub_categories']);
teacherPrefixTransformedCrossValidateData = teacherPrefixTransform(crossValidateData['teacher_prefix']);
schoolStateTransformedCrossValidateData = schoolStateTransform(crossValidateData['school_state']);
projectGradeTransformedCrossValidateData = projectGradeTransform(crossValidateData['project_grade_category']);
# Test data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(crossValidateData['project_essay'])[1];
preProcessedTitlesTemp = preProcessingWithAndWithoutStopWords(crossValidateData['project_title'])[1];
bowEssayTransformedCrossValidateData = bowEssayVectorizer.transform(preProcessedEssaysTemp);
bowTitleTransformedCrossValidateData = bowTitleVectorizer.transform(preProcessedTitlesTemp);
tfIdfEssayTransformedCrossValidateData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);
tfIdfTitleTransformedCrossValidateData = tfIdfTitleVectorizer.transform(preProcessedTitlesTemp);
avgWord2VecEssayTransformedCrossValidateData = getWord2VecVectors(preProcessedEssaysTemp);
avgWord2VecTitleTransformedCrossValidateData = getWord2VecVectors(preProcessedTitlesTemp);
tfIdfWeightedWord2VecEssayTransformedCrossValidateData = getAvgTfIdfEssayVectors(preProcessedEssaysTemp);
tfIdfWeightedWord2VecTitleTransformedCrossValidateData = getAvgTfIdfTitleVectors(preProcessedTitlesTemp);
# Test data numerical features transformation
priceTransformedCrossValidateData = priceScaler.transform(crossValidateData['price'].values.reshape(-1, 1));
quantityTransformedCrossValidateData = quantityScaler.transform(crossValidateData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedCrossValidateData = previouslyPostedScaler.transform(crossValidateData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
# Test data categorical features transformation
categoriesTransformedTestData = subjectsCategoriesTransform(testData['cleaned_categories']);
subCategoriesTransformedTestData = subjectsSubCategoriesTransform(testData['cleaned_sub_categories']);
teacherPrefixTransformedTestData = teacherPrefixTransform(testData['teacher_prefix']);
schoolStateTransformedTestData = schoolStateTransform(testData['school_state']);
projectGradeTransformedTestData = projectGradeTransform(testData['project_grade_category']);
# Test data text features transformation
preProcessedEssaysTemp = preProcessingWithAndWithoutStopWords(testData['project_essay'])[1];
preProcessedTitlesTemp = preProcessingWithAndWithoutStopWords(testData['project_title'])[1];
bowEssayTransformedTestData = bowEssayVectorizer.transform(preProcessedEssaysTemp);
bowTitleTransformedTestData = bowTitleVectorizer.transform(preProcessedTitlesTemp);
tfIdfEssayTransformedTestData = tfIdfEssayVectorizer.transform(preProcessedEssaysTemp);
tfIdfTitleTransformedTestData = tfIdfTitleVectorizer.transform(preProcessedTitlesTemp);
avgWord2VecEssayTransformedTestData = getWord2VecVectors(preProcessedEssaysTemp);
avgWord2VecTitleTransformedTestData = getWord2VecVectors(preProcessedTitlesTemp);
tfIdfWeightedWord2VecEssayTransformedTestData = getAvgTfIdfEssayVectors(preProcessedEssaysTemp);
tfIdfWeightedWord2VecTitleTransformedTestData = getAvgTfIdfTitleVectors(preProcessedTitlesTemp);
# Test data numerical features transformation
priceTransformedTestData = priceScaler.transform(testData['price'].values.reshape(-1, 1));
quantityTransformedTestData = quantityScaler.transform(testData['quantity'].values.reshape(-1, 1));
previouslyPostedTransformedTestData = previouslyPostedScaler.transform(testData['teacher_number_of_previously_posted_projects'].values.reshape(-1, 1));
def configure_plotly_browser_state():
import IPython
display(IPython.core.display.HTML('''
<script src="/static/components/requirejs/require.js"></script>
<script>
requirejs.config({
paths: {
base: '/static/base',
plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
},
});
</script>
'''))
techniques = ['Bag of words'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
bowTitleModelSub,\
bowEssayModelSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
bowTitleTransformedCrossValidateData,\
bowEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
bowTitleTransformedTestData,\
bowEssayTransformedTestData));
rfClassifier = RandomForestClassifier(class_weight="balanced", n_jobs = 4, min_samples_split = 400);
tunedParameters = {'n_estimators': [10, 20, 30, 50, 80, 120], 'max_depth': [10, 50, 100, 200, 400, 500]};
classifier = GridSearchCV(rfClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
classifier.fit(trainingMergedData, classesTrainingSub);
testScoresDataFrame = pd.DataFrame(data = np.hstack((classifier.cv_results_['param_n_estimators'].data[:, None], classifier.cv_results_['param_max_depth'].data[:, None], classifier.cv_results_['mean_test_score'][:, None], classifier.cv_results_['std_test_score'][:, None])), columns = ['n_estimators', 'max_depth', 'mts', 'stdts']);
testScoresDataFrame = testScoresDataFrame.astype(float);
crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
trace1 = go.Scatter3d(x = tunedParameters['n_estimators'], y = tunedParameters['max_depth'], z = crossValidateAucMeanValues, name = 'Cross-Validate');
data = [trace1];
layout = go.Layout(scene = dict(
xaxis = dict(title='n_estimators'),
yaxis = dict(title='max_depth'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
configure_plotly_browser_state()
offline.iplot(fig, filename='3d-scatter-colorscale')
optimalHypParamValue = classifier.best_params_['n_estimators'];
optimalHypParam2Value = classifier.best_params_['max_depth'];
rfClassifier = RandomForestClassifier(class_weight = 'balanced', n_estimators = optimalHypParamValue, max_depth = optimalHypParam2Value, n_jobs = 4, min_samples_split = 400);
rfClassifier.fit(trainingMergedData, classesTrainingSub);
predScoresTraining = rfClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining[:, 1]);
predScoresTest = rfClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest[:, 1]);
predictionClassesTest = rfClassifier.predict(testMergedData);
equalsBorder(70);
plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
plt.plot([0, 1], [0, 1], 'k-');
plt.xlabel("fpr values");
plt.ylabel("tpr values");
plt.grid();
plt.legend();
plt.show();
areaUnderRocValueTest = auc(fprTest, tprTest);
print("Results of analysis using {} vectorized text features merged with other features using random forest classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", optimalHypParamValue);
equalsBorder(40);
print("Optimal max_depth Value: ", optimalHypParam2Value);
equalsBorder(40);
print("AUC value of test data: ", str(areaUnderRocValueTest));
# Predicting classes of test data projects
predictionClassesTest = rfClassifier.predict(testMergedData);
equalsBorder(40);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="Greens");
plt.show();
print("Cross-validation curve: ");
print("="*40);
display(Image(filename = "Random Forests - Bag of words.png", unconfined = True, width = '400px'));
techniques = ['Tf-Idf'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
tfIdfTitleModelSub,\
tfIdfEssayModelSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
tfIdfTitleTransformedCrossValidateData,\
tfIdfEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
tfIdfTitleTransformedTestData,\
tfIdfEssayTransformedTestData));
rfClassifier = RandomForestClassifier(class_weight="balanced", n_jobs = 4, min_samples_split = 400);
tunedParameters = {'n_estimators': [10, 20, 30, 50, 80, 120], 'max_depth': [10, 50, 100, 200, 400, 500]};
classifier = GridSearchCV(rfClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
classifier.fit(trainingMergedData, classesTrainingSub);
testScoresDataFrame = pd.DataFrame(data = np.hstack((classifier.cv_results_['param_n_estimators'].data[:, None], classifier.cv_results_['param_max_depth'].data[:, None], classifier.cv_results_['mean_test_score'][:, None], classifier.cv_results_['std_test_score'][:, None])), columns = ['n_estimators', 'max_depth', 'mts', 'stdts']);
testScoresDataFrame = testScoresDataFrame.astype(float);
crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
trace1 = go.Scatter3d(x = tunedParameters['n_estimators'], y = tunedParameters['max_depth'], z = crossValidateAucMeanValues, name = 'Cross-Validate');
data = [trace1];
layout = go.Layout(scene = dict(
xaxis = dict(title='n_estimators'),
yaxis = dict(title='max_depth'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
configure_plotly_browser_state()
offline.iplot(fig, filename='3d-scatter-colorscale')
optimalHypParamValue = classifier.best_params_['n_estimators'];
optimalHypParam2Value = classifier.best_params_['max_depth'];
rfClassifier = RandomForestClassifier(class_weight = 'balanced', n_estimators = optimalHypParamValue, max_depth = optimalHypParam2Value, n_jobs = 4, min_samples_split = 400);
rfClassifier.fit(trainingMergedData, classesTrainingSub);
predScoresTraining = rfClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining[:, 1]);
predScoresTest = rfClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest[:, 1]);
predictionClassesTest = rfClassifier.predict(testMergedData);
equalsBorder(70);
plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
plt.plot([0, 1], [0, 1], 'k-');
plt.xlabel("fpr values");
plt.ylabel("tpr values");
plt.grid();
plt.legend();
plt.show();
areaUnderRocValueTest = auc(fprTest, tprTest);
print("Results of analysis using {} vectorized text features merged with other features using random forest classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", optimalHypParamValue);
equalsBorder(40);
print("Optimal max_depth Value: ", optimalHypParam2Value);
equalsBorder(40);
print("AUC value of test data: ", str(areaUnderRocValueTest));
# Predicting classes of test data projects
predictionClassesTest = rfClassifier.predict(testMergedData);
equalsBorder(40);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="Greens");
plt.show();
print("Cross-validation curve: ");
print("="*40);
display(Image(filename = "Random Forests - Tf-Idf.png", unconfined = True, width = '400px'));
techniques = ['Average Word2Vec'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
word2VecTitlesVectors,\
word2VecEssaysVectors));
crossValidateMergedData = hstack((crossValidateMergedData,\
avgWord2VecTitleTransformedCrossValidateData,\
avgWord2VecEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
avgWord2VecTitleTransformedTestData,\
avgWord2VecEssayTransformedTestData));
rfClassifier = RandomForestClassifier(class_weight="balanced", n_jobs = 4, min_samples_split = 400);
tunedParameters = {'n_estimators': [10, 20, 30, 50, 80, 120], 'max_depth': [10, 50, 100, 200]};
classifier = GridSearchCV(rfClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
classifier.fit(trainingMergedData, classesTrainingSub);
testScoresDataFrame = pd.DataFrame(data = np.hstack((classifier.cv_results_['param_n_estimators'].data[:, None], classifier.cv_results_['param_max_depth'].data[:, None], classifier.cv_results_['mean_test_score'][:, None], classifier.cv_results_['std_test_score'][:, None])), columns = ['n_estimators', 'max_depth', 'mts', 'stdts']);
testScoresDataFrame = testScoresDataFrame.astype(float);
crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
trace1 = go.Scatter3d(x = tunedParameters['n_estimators'], y = tunedParameters['max_depth'], z = crossValidateAucMeanValues, name = 'Cross-Validate');
data = [trace1];
layout = go.Layout(scene = dict(
xaxis = dict(title='n_estimators'),
yaxis = dict(title='max_depth'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
configure_plotly_browser_state()
offline.iplot(fig, filename='3d-scatter-colorscale')
optimalHypParamValue = classifier.best_params_['n_estimators'];
optimalHypParam2Value = classifier.best_params_['max_depth'];
rfClassifier = RandomForestClassifier(class_weight = 'balanced', n_estimators = optimalHypParamValue, max_depth = optimalHypParam2Value, n_jobs = 4, min_samples_split = 400);
rfClassifier.fit(trainingMergedData, classesTrainingSub);
predScoresTraining = rfClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining[:, 1]);
predScoresTest = rfClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest[:, 1]);
predictionClassesTest = rfClassifier.predict(testMergedData);
equalsBorder(70);
plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
plt.plot([0, 1], [0, 1], 'k-');
plt.xlabel("fpr values");
plt.ylabel("tpr values");
plt.grid();
plt.legend();
plt.show();
areaUnderRocValueTest = auc(fprTest, tprTest);
print("Results of analysis using {} vectorized text features merged with other features using random forest classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", optimalHypParamValue);
equalsBorder(40);
print("Optimal max_depth Value: ", optimalHypParam2Value);
equalsBorder(40);
print("AUC value of test data: ", str(areaUnderRocValueTest));
# Predicting classes of test data projects
predictionClassesTest = rfClassifier.predict(testMergedData);
equalsBorder(40);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="Greens");
plt.show();
print("Cross-validation curve: ");
print("="*40);
display(Image(filename = "Random Forests - Average word2Vec.png", unconfined = True, width = '400px'));
techniques = ['Tf-Idf Weighted Word2Vec'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
tfIdfWeightedWord2VecTitlesVectors,\
tfIdfWeightedWord2VecEssaysVectors));
crossValidateMergedData = hstack((crossValidateMergedData,\
tfIdfWeightedWord2VecTitleTransformedCrossValidateData,\
tfIdfWeightedWord2VecEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
tfIdfWeightedWord2VecTitleTransformedTestData,\
tfIdfWeightedWord2VecEssayTransformedTestData));
rfClassifier = RandomForestClassifier(class_weight="balanced", n_jobs = 4, min_samples_split = 400);
tunedParameters = {'n_estimators': [10, 20, 30, 50, 80, 120], 'max_depth': [10, 50, 100, 200, 400, 500]};
classifier = GridSearchCV(rfClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
classifier.fit(trainingMergedData, classesTrainingSub);
testScoresDataFrame = pd.DataFrame(data = np.hstack((classifier.cv_results_['param_n_estimators'].data[:, None], classifier.cv_results_['param_max_depth'].data[:, None], classifier.cv_results_['mean_test_score'][:, None], classifier.cv_results_['std_test_score'][:, None])), columns = ['n_estimators', 'max_depth', 'mts', 'stdts']);
testScoresDataFrame = testScoresDataFrame.astype(float);
crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
trace1 = go.Scatter3d(x = tunedParameters['n_estimators'], y = tunedParameters['max_depth'], z = crossValidateAucMeanValues, name = 'Cross-Validate');
data = [trace1];
layout = go.Layout(scene = dict(
xaxis = dict(title='n_estimators'),
yaxis = dict(title='max_depth'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
configure_plotly_browser_state()
offline.iplot(fig, filename='3d-scatter-colorscale')
optimalHypParamValue = classifier.best_params_['n_estimators'];
optimalHypParam2Value = classifier.best_params_['max_depth'];
rfClassifier = RandomForestClassifier(class_weight = 'balanced', n_estimators = optimalHypParamValue, max_depth = optimalHypParam2Value, n_jobs = 4, min_samples_split = 400);
rfClassifier.fit(trainingMergedData, classesTrainingSub);
predScoresTraining = rfClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining[:, 1]);
predScoresTest = rfClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest[:, 1]);
predictionClassesTest = rfClassifier.predict(testMergedData);
equalsBorder(70);
plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
plt.plot([0, 1], [0, 1], 'k-');
plt.xlabel("fpr values");
plt.ylabel("tpr values");
plt.grid();
plt.legend();
plt.show();
areaUnderRocValueTest = auc(fprTest, tprTest);
print("Results of analysis using {} vectorized text features merged with other features using random forest classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", optimalHypParamValue);
equalsBorder(40);
print("Optimal max_depth Value: ", optimalHypParam2Value);
equalsBorder(40);
print("AUC value of test data: ", str(areaUnderRocValueTest));
# Predicting classes of test data projects
predictionClassesTest = rfClassifier.predict(testMergedData);
equalsBorder(40);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="Greens");
plt.show();
print("Cross-validation curve: ");
print("="*40);
display(Image(filename = "Random Forests - Tf-Idf Weighted Word2Vec.png", unconfined = True, width = '400px'));
techniques = ['Bag of words'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
bowTitleModelSub,\
bowEssayModelSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
bowTitleTransformedCrossValidateData,\
bowEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
bowTitleTransformedTestData,\
bowEssayTransformedTestData));
gbdtClassifier = xgb.XGBClassifier(n_jobs = 6, min_samples_split = 400, reg_alpha = 1, reg_lambda = 0, subsample = 0.5, colsample_bytree = 0.5);
tunedParameters = {'n_estimators': [10, 20, 50, 120], 'max_depth': [2, 4, 6, 10, 15]};
classifier = GridSearchCV(gbdtClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
classifier.fit(trainingMergedData, classesTrainingSub);
testScoresDataFrame = pd.DataFrame(data = np.hstack((classifier.cv_results_['param_n_estimators'].data[:, None], classifier.cv_results_['param_max_depth'].data[:, None], classifier.cv_results_['mean_test_score'][:, None], classifier.cv_results_['std_test_score'][:, None])), columns = ['n_estimators', 'max_depth', 'mts', 'stdts']);
testScoresDataFrame = testScoresDataFrame.astype(float);
crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
trace1 = go.Scatter3d(x = tunedParameters['n_estimators'], y = tunedParameters['max_depth'], z = crossValidateAucMeanValues, name = 'Cross-Validate');
data = [trace1];
layout = go.Layout(scene = dict(
xaxis = dict(title='n_estimators'),
yaxis = dict(title='max_depth'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
configure_plotly_browser_state()
offline.iplot(fig, filename='3d-scatter-colorscale')
optimalHypParamValue = classifier.best_params_['n_estimators'];
optimalHypParam2Value = classifier.best_params_['max_depth'];
gbdtClassifier = xgb.XGBClassifier(n_estimators = optimalHypParamValue, max_depth = optimalHypParam2Value, n_jobs = 6, min_samples_split = 400, reg_alpha = 1, reg_lambda = 0, subsample = 0.5, colsample_bytree = 0.5);
gbdtClassifier.fit(trainingMergedData, classesTrainingSub);
predScoresTraining = gbdtClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining[:, 1]);
predScoresTest = gbdtClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest[:, 1]);
predictionClassesTest = gbdtClassifier.predict(testMergedData);
equalsBorder(70);
plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
plt.plot([0, 1], [0, 1], 'k-');
plt.xlabel("fpr values");
plt.ylabel("tpr values");
plt.grid();
plt.legend();
plt.show();
areaUnderRocValueTest = auc(fprTest, tprTest);
print("Results of analysis using {} vectorized text features merged with other features using gradient boosting classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", optimalHypParamValue);
equalsBorder(40);
print("Optimal max_depth Value: ", optimalHypParam2Value);
equalsBorder(40);
print("AUC value of test data: ", str(areaUnderRocValueTest));
# Predicting classes of test data projects
predictionClassesTest = gbdtClassifier.predict(testMergedData);
equalsBorder(40);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="Greens");
plt.show();
print("Cross-validation curve: ");
print("="*40);
display(Image(filename = "GBDT - Bag of words.png", unconfined = True, width = '400px'));
techniques = ['Tf-Idf'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
tfIdfTitleModelSub,\
tfIdfEssayModelSub));
crossValidateMergedData = hstack((crossValidateMergedData,\
tfIdfTitleTransformedCrossValidateData,\
tfIdfEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
tfIdfTitleTransformedTestData,\
tfIdfEssayTransformedTestData));
gbdtClassifier = xgb.XGBClassifier(n_jobs = 6, min_samples_split = 400, reg_alpha = 1, reg_lambda = 0, subsample = 0.5, colsample_bytree = 0.5);
tunedParameters = {'n_estimators': [10, 20, 50, 120], 'max_depth': [2, 4, 6, 10, 15]};
classifier = GridSearchCV(gbdtClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
classifier.fit(trainingMergedData, classesTrainingSub);
testScoresDataFrame = pd.DataFrame(data = np.hstack((classifier.cv_results_['param_n_estimators'].data[:, None], classifier.cv_results_['param_max_depth'].data[:, None], classifier.cv_results_['mean_test_score'][:, None], classifier.cv_results_['std_test_score'][:, None])), columns = ['n_estimators', 'max_depth', 'mts', 'stdts']);
testScoresDataFrame = testScoresDataFrame.astype(float);
crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
trace1 = go.Scatter3d(x = tunedParameters['n_estimators'], y = tunedParameters['max_depth'], z = crossValidateAucMeanValues, name = 'Cross-Validate');
data = [trace1];
layout = go.Layout(scene = dict(
xaxis = dict(title='n_estimators'),
yaxis = dict(title='max_depth'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
configure_plotly_browser_state()
offline.iplot(fig, filename='3d-scatter-colorscale')
optimalHypParamValue = classifier.best_params_['n_estimators'];
optimalHypParam2Value = classifier.best_params_['max_depth'];
gbdtClassifier = xgb.XGBClassifier(n_estimators = optimalHypParamValue, max_depth = optimalHypParam2Value, n_jobs = 6, min_samples_split = 400, reg_alpha = 1, reg_lambda = 0, subsample = 0.5, colsample_bytree = 0.5);
gbdtClassifier.fit(trainingMergedData, classesTrainingSub);
predScoresTraining = gbdtClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining[:, 1]);
predScoresTest = gbdtClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest[:, 1]);
predictionClassesTest = gbdtClassifier.predict(testMergedData);
equalsBorder(70);
plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
plt.plot([0, 1], [0, 1], 'k-');
plt.xlabel("fpr values");
plt.ylabel("tpr values");
plt.grid();
plt.legend();
plt.show();
areaUnderRocValueTest = auc(fprTest, tprTest);
print("Results of analysis using {} vectorized text features merged with other features using gradient boosting classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", optimalHypParamValue);
equalsBorder(40);
print("Optimal max_depth Value: ", optimalHypParam2Value);
equalsBorder(40);
print("AUC value of test data: ", str(areaUnderRocValueTest));
# Predicting classes of test data projects
predictionClassesTest = gbdtClassifier.predict(testMergedData);
equalsBorder(40);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="Greens");
plt.show();
print("Cross-validation curve: ");
print("="*40);
display(Image(filename = "GBDT - Tf-Idf.png", unconfined = True, width = '400px'));
techniques = ['Average Word2Vec'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
word2VecTitlesVectors,\
word2VecEssaysVectors));
crossValidateMergedData = hstack((crossValidateMergedData,\
avgWord2VecTitleTransformedCrossValidateData,\
avgWord2VecEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
avgWord2VecTitleTransformedTestData,\
avgWord2VecEssayTransformedTestData));
gbdtClassifier = xgb.XGBClassifier(n_jobs = 6, min_samples_split = 400, reg_alpha = 1, reg_lambda = 0, subsample = 0.5, colsample_bytree = 0.5);
tunedParameters = {'n_estimators': [10, 20, 80, 130], 'max_depth': [2, 4, 6, 10, 15]};
classifier = GridSearchCV(gbdtClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
classifier.fit(trainingMergedData, classesTrainingSub);
testScoresDataFrame = pd.DataFrame(data = np.hstack((classifier.cv_results_['param_n_estimators'].data[:, None], classifier.cv_results_['param_max_depth'].data[:, None], classifier.cv_results_['mean_test_score'][:, None], classifier.cv_results_['std_test_score'][:, None])), columns = ['n_estimators', 'max_depth', 'mts', 'stdts']);
testScoresDataFrame = testScoresDataFrame.astype(float);
crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
trace1 = go.Scatter3d(x = tunedParameters['n_estimators'], y = tunedParameters['max_depth'], z = crossValidateAucMeanValues, name = 'Cross-Validate');
data = [trace1];
layout = go.Layout(scene = dict(
xaxis = dict(title='n_estimators'),
yaxis = dict(title='max_depth'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
configure_plotly_browser_state()
offline.iplot(fig, filename='3d-scatter-colorscale')
optimalHypParamValue = classifier.best_params_['n_estimators'];
optimalHypParam2Value = classifier.best_params_['max_depth'];
gbdtClassifier = xgb.XGBClassifier(n_estimators = optimalHypParamValue, max_depth = optimalHypParam2Value, n_jobs = 6, min_samples_split = 400, reg_alpha = 1, reg_lambda = 0, subsample = 0.5, colsample_bytree = 0.5);
gbdtClassifier.fit(trainingMergedData, classesTrainingSub);
predScoresTraining = gbdtClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining[:, 1]);
predScoresTest = gbdtClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest[:, 1]);
predictionClassesTest = gbdtClassifier.predict(testMergedData);
equalsBorder(70);
plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
plt.plot([0, 1], [0, 1], 'k-');
plt.xlabel("fpr values");
plt.ylabel("tpr values");
plt.grid();
plt.legend();
plt.show();
areaUnderRocValueTest = auc(fprTest, tprTest);
print("Results of analysis using {} vectorized text features merged with other features using gradient boosting classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", optimalHypParamValue);
equalsBorder(40);
print("Optimal max_depth Value: ", optimalHypParam2Value);
equalsBorder(40);
print("AUC value of test data: ", str(areaUnderRocValueTest));
# Predicting classes of test data projects
predictionClassesTest = gbdtClassifier.predict(testMergedData);
equalsBorder(40);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="Greens");
plt.show();
print("Cross-validation curve: ");
print("="*40);
display(Image(filename = "GBDT - Average Word2Vec.png", unconfined = True, width = '400px'));
techniques = ['Tf-Idf Weighted Word2Vec'];
for index, technique in enumerate(techniques):
trainingMergedData = hstack((categoriesVectorsSub,\
subCategoriesVectorsSub,\
teacherPrefixVectorsSub,\
schoolStateVectorsSub,\
projectGradeVectorsSub,\
priceStandardizedSub,\
previouslyPostedStandardizedSub));
crossValidateMergedData = hstack((categoriesTransformedCrossValidateData,\
subCategoriesTransformedCrossValidateData,\
teacherPrefixTransformedCrossValidateData,\
schoolStateTransformedCrossValidateData,\
projectGradeTransformedCrossValidateData,\
priceTransformedCrossValidateData,\
previouslyPostedTransformedCrossValidateData));
testMergedData = hstack((categoriesTransformedTestData,\
subCategoriesTransformedTestData,\
teacherPrefixTransformedTestData,\
schoolStateTransformedTestData,\
projectGradeTransformedTestData,\
priceTransformedTestData,\
previouslyPostedTransformedTestData));
if(index == 0):
trainingMergedData = hstack((trainingMergedData,\
tfIdfWeightedWord2VecTitlesVectors,\
tfIdfWeightedWord2VecEssaysVectors));
crossValidateMergedData = hstack((crossValidateMergedData,\
tfIdfWeightedWord2VecTitleTransformedCrossValidateData,\
tfIdfWeightedWord2VecEssayTransformedCrossValidateData));
testMergedData = hstack((testMergedData,\
tfIdfWeightedWord2VecTitleTransformedTestData,\
tfIdfWeightedWord2VecEssayTransformedTestData));
gbdtClassifier = xgb.XGBClassifier(n_jobs = 6, min_samples_split = 400, reg_alpha = 1, reg_lambda = 0, subsample = 0.5, colsample_bytree = 0.5);
tunedParameters = {'n_estimators': [80, 100, 130], 'max_depth': [2, 4, 6, 10, 15]};
classifier = GridSearchCV(gbdtClassifier, tunedParameters, cv = 5, scoring = 'roc_auc');
classifier.fit(trainingMergedData, classesTrainingSub);
testScoresDataFrame = pd.DataFrame(data = np.hstack((classifier.cv_results_['param_n_estimators'].data[:, None], classifier.cv_results_['param_max_depth'].data[:, None], classifier.cv_results_['mean_test_score'][:, None], classifier.cv_results_['std_test_score'][:, None])), columns = ['n_estimators', 'max_depth', 'mts', 'stdts']);
testScoresDataFrame = testScoresDataFrame.astype(float);
crossValidateAucMeanValues = classifier.cv_results_['mean_test_score'];
crossValidateAucStdValues = classifier.cv_results_['std_test_score'];
trace1 = go.Scatter3d(x = tunedParameters['n_estimators'], y = tunedParameters['max_depth'], z = crossValidateAucMeanValues, name = 'Cross-Validate');
data = [trace1];
layout = go.Layout(scene = dict(
xaxis = dict(title='n_estimators'),
yaxis = dict(title='max_depth'),
zaxis = dict(title='AUC'),))
fig = go.Figure(data=data, layout=layout)
configure_plotly_browser_state()
offline.iplot(fig, filename='3d-scatter-colorscale')
optimalHypParamValue = classifier.best_params_['n_estimators'];
optimalHypParam2Value = classifier.best_params_['max_depth'];
gbdtClassifier = xgb.XGBClassifier(n_estimators = optimalHypParamValue, max_depth = optimalHypParam2Value, n_jobs = 6, min_samples_split = 400, reg_alpha = 1, reg_lambda = 0, subsample = 0.5, colsample_bytree = 0.5);
gbdtClassifier.fit(trainingMergedData, classesTrainingSub);
predScoresTraining = gbdtClassifier.predict_proba(trainingMergedData);
fprTrain, tprTrain, thresholdTrain = roc_curve(classesTraining, predScoresTraining[:, 1]);
predScoresTest = gbdtClassifier.predict_proba(testMergedData);
fprTest, tprTest, thresholdTest = roc_curve(classesTest, predScoresTest[:, 1]);
predictionClassesTest = gbdtClassifier.predict(testMergedData);
equalsBorder(70);
plt.plot(fprTrain, tprTrain, label = "Train AUC = " + str(auc(fprTrain, tprTrain)));
plt.plot(fprTest, tprTest, label = "Test AUC = " + str(auc(fprTest, tprTest)));
plt.plot([0, 1], [0, 1], 'k-');
plt.xlabel("fpr values");
plt.ylabel("tpr values");
plt.grid();
plt.legend();
plt.show();
areaUnderRocValueTest = auc(fprTest, tprTest);
print("Results of analysis using {} vectorized text features merged with other features using gradient boosting classifier: ".format(technique));
equalsBorder(70);
print("Optimal n_estimators Value: ", optimalHypParamValue);
equalsBorder(40);
print("Optimal max_depth Value: ", optimalHypParam2Value);
equalsBorder(40);
print("AUC value of test data: ", str(areaUnderRocValueTest));
# Predicting classes of test data projects
predictionClassesTest = gbdtClassifier.predict(testMergedData);
equalsBorder(40);
# Printing confusion matrix
confusionMatrix = confusion_matrix(classesTest, predictionClassesTest);
# Creating dataframe for generated confusion matrix
confusionMatrixDataFrame = pd.DataFrame(data = confusionMatrix, index = ['Actual: NO', 'Actual: YES'], columns = ['Predicted: NO', 'Predicted: YES']);
print("Confusion Matrix : ");
equalsBorder(60);
sbrn.heatmap(confusionMatrixDataFrame, annot = True, fmt = 'd', cmap="Greens");
plt.show();
print("Cross-validation curve: ");
print("="*40);
display(Image(filename = "GBDT - Tf-Idf Weighted Word2Vec.png", unconfined = True, width = '400px'));
techniques = ['Bag of words', 'Tf-Idf', 'Average Word2Vec', 'Tf-Idf Weighted Word2Vec', 'Bag of words', 'Tf-Idf', 'Average Word2Vec', 'Tf-Idf Weighted Word2Vec'];
aucValues = [0.7075, 0.7057, 0.6838, 0.6860, 0.7083, 0.7067, 0.6989, 0.7031]
models = ['Random Forests', 'Random Forests', 'Random Forests', 'Random Forests', 'Gradient Boosting - DT', 'Gradient Boosting - DT', 'Gradient Boosting - DT', 'Gradient Boosting - DT'];
nEstimatorsValues = [120, 120, 120, 120, 100, 120, 130, 130]
maxDepthValues = [100, 50, 50, 200, 5, 10, 4, 4, 4]
for i,technique in enumerate(techniques):
randomForestsAndGbdtResultsDataFrame = randomForestsAndGbdtResultsDataFrame.append({'Vectorizer': technique, 'Model': models[i], 'Max Depth': maxDepthValues[i], 'N Estimators': nEstimatorsValues[i], 'AUC': aucValues[i]}, ignore_index = True);
randomForestsAndGbdtResultsDataFrame